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L6: Entry 9 of 11 File: USPT Sep 4, 2001 



DOCUMENT- IDENTIFIER : US 6285999 Bl 

TITLE: Method for node ranking in a linked database 



Abstract Text (1) : 

A method assigns importance ranks to nodes in a linked database, such as any 
database of documents containing citations, the world wide web or any other 
hypermedia database. The rank assigned to a document is calculated from the ranks 
of documents citing it. In addition, the rank of a document is calculated from a 
constant representing the probability that a browser through the database will 
randomly jump to the document. The method is particularly useful in enhancing the 
performance of search engine results for hypermedia databases, such as the world 
wide web, whose documents have a large variation in quality. 

Brief Summary Text (2) : 

This invention relates generally to techniques for analyzing linked databases . More 
particularly, it relates to methods for assigning ranks to nodes in a linked 
database, such as any database of documents containing citations, the world wide 
web or any other hypermedia database. 

Brief Summary Text (11) : 

Various aspects of the present invention provide systems and methods for ranking 
documents in a linked database . One aspect provides an objective ranking based on 
the relationship between documents. Another aspect of the invention is directed to 
a technique for ranking documents within a database whose content has a large 
variation in quality and importance. Another aspect of the present invention is to 
provide a document ranking method that is scalable and can be applied to extremely 
large databases such as the world wide web. Additional aspects of the invention 
will become apparent in view of the following description and associated figures. 

Brief Summary Text (12): 

One aspect of the present invention is directed to taking advantage of the linked 
structure of a database to assign a rank to each document in the database, where 
the document rank is a measure of the importance of a document. Rather than 
determining relevance only from the intrinsic content of a document, or from the 
anchor text of backlinks to the document, a method consistent with the invention 
determines importance from the extrinsic relationships between documents. 
Intuitively, a document should be important (regardless of its content) if it is 
highly cited by other documents. Not all citations, however, are necessarily of 
equal significance. A citation from an important document is more important than a 
citation from a relatively unimportant document. Thus, the importance of a page, 
and hence the rank assigned to it, should depend not just on the number of 
citations it has, but on the importance of the citing documents as well. This 
implies a recursive definition of rank: the rank of a document is a function of the 
ranks of the documents which cite it. The ranks of documents may be calculated by 
an iterative procedure on a linked database . 

Brief Summary Text (14) : 

In one aspect of the invention, a computer implemented method is provided for 
scoring linked database documents. The method comprises the steps of: 
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Detailed Description Text (3) : 

A linked database (i.e. any database of documents containing mutual citations, such 
as the world wide web or other hypermedia archive, a dictionary or thesaurus, and a 
database of academic articles, patents, or court cases) can be represented as a 
directed graph of N nodes, where each node corresponds to a web page document and 
where the directed connections between nodes correspond to links from one document 
to another. A given node has a set of forward links that connect it to children 
nodes, and a set of backward links that connect it to parent nodes. FIG. 1 shows a 
typical relationship between three hypertext documents A, B, and C. As shown in 
this particular figure, the first links in documents B and C are pointers to 
document A. In this case we say that B and C are backlinks of A, and that A is a 
forward link of B and of C. Documents B and C also have other forward links to 
documents that are not shown. 

Detailed Description Text (6) : 

According to one embodiment of the present method of ranking, the backlinks from 
different pages are weighted differently and the number of links on each page is 
normalized. More precisely, the rank of a page A is defined according to the 
present invention as ##EQU1## 

Detailed Description Text (8) : 

The ranks form a probability distribution over web pages, so that the sum of ranks 
over all web pages is unity. The rank of a page can be interpreted as the 
probability that a surfer will be at the page after following a large number of 
forward links. The constant .alpha, in the formula is interpreted as the 
probability that the web surfer will jump randomly to any web page instead of 
following a forward link . The page ranks for all the pages can be calculated using 
a simple iterative algorithm, and corresponds to the principal eigenvector of the 
normalized link matrix of the web, as will be discussed in more detail below. 

Detailed Description Text (9): 

In order to illustrate the present method of ranking, consider the simple web of 
three documents shown in FIG. 2. For simplicity of illustration, we assume in this 
example that r=0 . Document A has a single backlink to document C, and this is the 
only forward link of document C, so 

Detailed Description Text (13) : 
f Document C has two backlinks. One backlink is to document B, and this is the only 
forward link of document, B. The other backlink is to document A via the other of 
the two forward links from A. Thus 

Detailed Description Text (15) : 

In this simple illustrative case we can see by inspection that r(A)=0.4, r(B)=0.2, 
and r(C)=0.4. Although a typical value for .alpha, is .about. 0.1, if for simplicity 
we set .alpha. =0.5 (which corresponds to a 50% chance that a surfer will randomly 
jump to one of the three pages rather than following a forward link ) , then the 
mathematical relationships between the ranks become more complicated. In 
particular, we then have 

Detailed Description Text (21) : 

The iteration process can be understood as a steady-state probability distribution 
calculated from a model of a random surfer. This model is mathematically equivalent 
to the explanation described above, but provides a more direct and concise 
characterization of the procedure. The model includes (a) an initial N-dimensional 
probability distribution vector p.sub.O where each component p.siib.O [i] gives the 
initial probability that a random surfer will start at a node i, and (b) an 
N. times. N transition probability matrix A where .each component A[i] [j] gives the 
probability that the surfer will move from node i to node j . The probability 
distribution of the graph after the surfer follows one link is p.sub.l =Ap.sub.O, 
and after two links the probability distribution is p. sub. 2 =Ap.sub.l =A.sup.2 
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p.sup.O. Assuming this iteration converges, it will converge to a steady-state 
probability ##EQU2## 

Detailed Description Text (22) : 

which is a dominant eigenvector of A. The iteration circulates the probability 
through the linked nodes like energy flows through a circuit and accumulates in 
important places. Because pages with no links occur in significant numbers and 
bleed off energy, they cause some complication with computing . the ranking. This 
complication is caused by the fact they can add huge amounts to the "random jump" 
factor. This, in turn, causes loops in the graph to be highly emphasized which is 
not generally a desirable property of the model. In order to address this problem, 
these childless pages can simply be removed from the model during the iterative 
stages, and added back in after the iteration is complete. After the childless 
pages are added back in, however, the same number of iterations that was required 
to remove them should be done to make sure they all receive a value. (Note that in 
order to ensure convergence, the norm of p.sub.i must be made equal to 1 after each 
iteration.) An alternate method to control the contribution of the childless nodes 
is to only estimate the steady state by iterating a small number of times. 

Detailed Description Text (23) : 

The rank r[i] of a node i can then be defined as a function of this steady-state 
probability distribution. For example, the rank can be defined simply by r[i] 
=p . inf in. [i] . This method of calculating rank is mathematically equivalent to the 
iterative method described first. Those skilled in the art will appreciate that 
this same method can be characterized in various different ways that are 
mathematically equivalent. Such characterizations are obviously within the scope of 
the present invention. Because the rank of various different documents can vary by 
orders of magnitude, it is convenient to define a logarithmic rank ##EQU3## 

Detailed Description Text (24) : 

which assigns a rank of 0 to the lowest ranked node and increases by 1 for each 
order of magnitude in importance higher than the lowest ranked node . 

Detailed Description Text (25) : 

"FIG. 3 shows one embodiment of a computer implemented method for calculating an 
importance rank for N linked nodes of a linked database . At a step 101, an initial 
N-dimensional vector p.sub.O is selected. An approximation p.sub.n to a steady- 
state probability p . sub . . inf in . in accordance with the equation p.sub.n =A.sup.n 
p.sub.O is computed at a step 103. Matrix A can be an N. times. N transition 
probability matrix having elements A[i] [j] representing a probability of moving 
from node i to node j . At a step 105, a rank r[k] for node k from a k.sup.th 
component of p.sub.n is determined.". 

Detailed Description Text (26) : 

In one particular embodiment, a finite number of iterations are performed to 
approximate p. inf in.. The initial distribution can be selected to be uniform or 
non-uniform. A uniform distribution would set each component of p.sub.O equal to 
1/N. A non-uniform distribution, for example, can divide the initial probability 
among a few nodes which are known a priori to have relatively large importance. 
This non-uniform distribution decreases the number of iterations required to obtain 
a close approximation to p. inf in. and also is one way to reduce the effect of 
artificially inflating relevance by adding unrelated terms. 

Detailed Description Text (28) : 

where 1 is an N. times. N matrix consisting of all Is, .alpha, is the probability 
that a surfer will jump randomly to any one of the N nodes, and B is a matrix whose 
elements B[i][j] are given by ##EQU5## 

Detailed Description Text (29) : 

where n.sub.i is the total number of forward links from node i. The (1-. alpha.) 
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factor acts as a damping factor that limits the extent to which a document's rank 
can be inherited by children documents. This models the fact that users typically 
jump to a different place in the web after following a few links. The value 
of .alpha, is typically around 15%. Including this damping is important when many 
iterations are used to calculate the rank so that there is no artificial 
concentration of rank importance within loops of the web. Alternatively, one may 
set . alpha. =0 and only iterate a few times in the calculation. 

Detailed Description Text (30) : 

Consistent with the present invention, there are several ways that this method can 
be adapted or altered for various purposes. As already mentioned above, rather than 
including the random linking probability .alpha, equally among all nodes, it can be 
divided in various ways among all the sites by changing the 1 matrix to another 
matrix. For example, it could be distributed so that a random jump takes the surfer 
to one of a few nodes that, have a high importance, and will not take the surfer to 
any of the other nodes . This can be very effective in preventing deceptively tagged 
documents from receiving artificially inflated relevance. Alternatively, the random 
linking probability could be distributed so that random jumps do not happen from 
high importance nodes, and only happen from other nodes. This distribution would 
model a surfer who is more likely to make random jumps from unimportant sites and 
follow forward links from important sites. A modification to avoid drawing 
unwarranted attention to pages with artificially inflated relevance is to ignore 
local links between documents and only consider links between separate domains. 
Because the links from other sites to the document are not directly under the 
control of a typical web site designer, it is then difficult for the designer to 
artificially inflate the ranking. A simpler approach is to weight links from pages 
contained on the same web server less than links from other servers. Also, in 
addition to servers, internet domains and any general measure of the distance 
between links could be used to determine such a weighting . 

Detailed Description Text (32) : 

Links can also be weighted by their relative importance within a document. For 
example, highly visible links that are near the top of a document can be given more 
weight . Also, links that are in large fonts or emphasized in other ways can be 
given more weight . In this way, the model better approximates human usage and 
authors' intentions. In many cases it is appropriate to assign higher value to 
links coming from pages that have been modified recently since such information is 
less likely to be obsolete. 

Detailed Description Text (38) : 

Another important application and embodiment of the present invention is directed 
to enhancing the quality of results from web search engines. In this application of 
the present invention, a ranking method according to the invention is integrated 
into a web search engine to produce results far superior to existing methods in 
quality and performance. A search engine employing a ranking method of the present 
invention provides automation while producing results comparable to a human 
maintained categorized system. In this approach, a web crawler explores the web and 
creates an index of the web content, as well as a directed graph of nodes 
corresponding to the structure of hyperlinks. The nodes of the graph (i.e. pages of 
the web) are then ranked according to importance as described above in connection 
with various exemplary embodiments of the present invention. 

CLAIMS : 

2.- The method of claim 1, wherein the assigning includes: 

identifying a weighting factor for each of the linking documents, the weighting 
factor being dependent on the number of links to the one or more linking documents, 
and 
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adjusting the score of each of the one or more linking documents based on the 
identified weighting factor. 

3. The method of claim 1, wherein the assigning includes: 

identifying a weighting factor for each of the linking documents, the weighting 
factor being dependent on an estimation of a probability that a linking document 
will be accessed, and 

adjusting the score of each of the one or more linking documents based on the 
identified weighting factor. 

4. The method of claim 1, wherein the assigning includes: 

identifying a weighting factor for each of the linking documents, the weighting 
factor being dependent on the URL, host, domain, author, institution, or last 
update time of the one or more linking documents, and 

adjusting the score of each of the one or more linking documents based on the 
identified weighting factor. 

5. The method of claim 1, wherein the assigning includes : 

identifying a weighting factor for each of the linking documents, the weighting 
factor being dependent on whether the one or more linking documents are selected 
documents or roots, and 

adjusting the score of each of the one or more linking documents based on the 
identified weighting factor. 

6. The method of claim 1, wherein the assigning includes: 

identifying a weighting factor for each of the linking documents, the weighting 
factor being dependent on the importance, visibility or textual emphasis of the 
links in the one or more linking documents, and • 

adjusting the score of each of the one or more linking documents based on the 
identified weighting factor. 

7. The method of claim 1, wherein the assigning includes: 

identifying a weighting factor for each of the linking documents, the weighting 
factor being dependent on a particular user's preferences, the rate at which users 
access the one or more linking documents, or the importance of the one or more 
linking documents, and 

adjusting the score of each of the one or more linking documents based on the 
identified weighting factor. 

22. The method of claim 1, wherein the assigning a score includes: 

associating one or more backlinks with each of the linked documents, each of the 
backlinks corresponding to one of the linking documents that links to the linked 
document, 

assigning a weight to each of the backlinks, and 

determining a score for each of the linked documents based on a number of backlinks 
for the linked document and the weights assigned to the backlinks. 
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24. The method of claim 22, wherein the assigning a weight includes: 

assigning different weights to at least some of the backlinks associated with at 
least one of the linked documents. 

25. The method of claim 1, wherein the assigning a score includes: 

associating one or more backlinks with each of the linked documents, each of the 
backlinks corresponding to one of the linking documents that links to the linked 
document, 

assigning a weight to each of the backlinks, and 

determining a score for each of the linked documents based on a sum of the weights 
assigned to the backlinks associated with the linked document. 

26. The method of claim 25, wherein the weights assigned to each of the backlinks 
are independent of text of the corresponding linking documents. 
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File: PGPB 



Sep 30, 2004 



DOCUMENT-IDENTIFIER: US 20040193636 Al 

TITLE: Method for identifying related pages in a hyperlinked database 



Abstract Paragraph : 

A method is described for identifying related pages among a plurality of pages in a 
linked database such as the World Wide Web. An initial page is selected from the 
plurality of pages. Pages linked to the initial page are represented as a graph in 
a memory. The pages represented in the graph are scored on content, and a set of 
pages is selected, the selected set of pages having scores greater than a first 
predetermined threshold. The selected set of pages is scored on connectivity, and a 
subset of the set of pages that have scores greater than a second predetermined 
threshold are selected as related pages. 

Summary of Invention Paragraph : 

[0006] The vicinity of a Web page is defined by the hyperlinks that connect the 
page to others. A Web page can point to other pages, and the page can be pointed to 
by other pages. Close pages are directly linked, farther pages are indirectly 
linked via intermediate pages. This connectivity can be expressed as a graph where 
nodes represent the pages, and the directed edges represent the links. The vicinity 
of all the pages 'in the result set, up to a certain distance, is called the 
neighborhood graph. 

Summary of Invention Paragraph : 

[0011] In U.S. patent application Ser. No. 09/058,577 "Method for Ranking Documents 
in a Hyperlinked Environment using Connectivity and Selective Content Analysis" 
filed by Bharat et al . on Apr. 9, 1998, a method is described which performs 
content analysis on only a small subset of the pages in the neighborhood graph to 
determine relevance weights, and pages with low relevance weights are pruned from 
the graph. Then, the pruned .graphed is ranked according to a connectivity analysis. 
This method still requires the result set of a query to form a query topic. 

Summary of Invention Paragraph : 

[0012] Therefore, there is a need for a method for identifying related pages in a 
linked database that does not require a query arid the fetching of many unrelated 
pages . 

Summary of Invention Paragraph : 

[0013] Provided is a method for identifying related pages among a plurality of 
pages in a linked database such as the World Wide Web. An initial page is selected 
from the plurality of pages by specifying the URL of the page or clicking on the 
page using a Web browser in a convenient manner. 

Detail Description Paragraph : 

[0028] We use the initial page 201 to construct 210 a neighborhood graph (n-graph) 
211 in a memory. Nodes 212 in the graph represent the initial selected page 201 as 
well as other closely linked pages, as described below. The edges 213 denote the 
hyperlinks between pages. The "size" of the graph is determined by K which can be 
preset or adjusted dynamically as the graph is constructed. The idea being that the 
graph needs to represent a meaningful number of page. 
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Detail Description Paragraph : 

[0029] During the construction of the neighborhood graph 211, the direction of 
links is considered as a way of pruning the graph. In the preferred implementation, 
with K=2, our method only includes nodes at distance 2 that. are reachable by going 
one link backwards ("B"), pages reachable by going one link forwards ( "F" ) / pages 
reachable by going one link backwards followed by one link forward ("BF U ) and those 
reachable by going one link forwards and one link backwards ("FB" ) . This eliminates 
nodes that are reachable only by going forward two links ( fl FF" ) or backwards two 
links ( "BB" ) . 

Detail Description Paragraph : 

[0030] To eliminate some unrelated nodes from the neighborhood graph 211, our 
method relies on a list 299 of "stop" URLs. Stop URLs are URLs that are so popular 
that they are frequently referenced from many, many pages, such as, for instance 
URLs that refer to popular search engines. An example is "www.altavista.com." These 
"stop" nodes are very general purpose and so are generally not related to the 
specific topic of the selected page 201, and consequently serve no purpose in the 
neighborhood graph. Our method checks each URL against the stop list 2 99 during the 
neighborhood graph construction, and eliminates the node and all incoming and 
outgoing edges if a URL is found on the stop list 299. 

Detail Description Paragraph : 

[0031] In some cases, the neighborhood graph becomes too large. For example, highly 
popular pages are often pointed to by many thousands of pages and including all 
such pages in the neighborhood graph is impractical. Similarly, some pages contain 
thousands of outgoing links, which also cause the graph to become too large. Our 
method filters the incoming or outgoing edges by choosing only a fixed number M of 
them. In our preferred implementation, M is 50. In the case that the page was 
reached by a backwards link L, and the page has more than M outgoing links, our 
method chooses the M links that surround the link L on the page. 

Detail Description Paragraph : 

[0033] In some cases, pages will have identical content, or nearly identical 
contents. This can happen when pages were copied, for example. In such cases, we 
want to include only one such page in our neighborhood graph, since the presence of 
multiple copies of a page will tend to artificially increase the importance of any 
pages that the copies point to. We collapse duplicate pages to a single node in the 
neighborhood graph. There are several ways that one could identify duplicate pages. 

Detail Description Paragraph : 

Relevancy Scoring of Nodes in the Neighborhood Graph 
Detail Description Paragraph : 

[0037] A vector matching operation based on cosine of the angle between vectors is 
used to produces scores 203 that measure similarity. Please see, Salton et al . , 
"Ter m-Weighting Approaches in Automatic Text Retrieval," Information Processing and 
Management, 24(5), 513-23, 1988. A probabilistic model is described by Croft et al . 
in "Using Probabilistic Models of Document Retrieval without Relevance Feedback, " 
Documentation, 35(4), 285-94, 1979. For a survey of ranking techniques in 
Information Retrieval see Frakes et al . , "Information Retrieval: Data Structures & 
Algorithms," Chapter 14- % Ranking Algorithms, % Prentice-Hall, N.J., 1992. 

Detail Description Paragraph : 

[0038] Our topic vector can be determined as the term vector of the initial page 
201, or as a vector sum of the term vector of the initial selected page and some 
function of the term vectors of all the pages presented in the neighborhood graph 
211. One such function could simply weight the term vectors of each of the pages 
equally, while another more complex function would give more weight to the term 
vectors of pages that are at a smaller distance K from the selected page 201. 
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Scoring 220 results in a scored graph 215. 
Detail Description Paragraph : 

Pruning Nodes in the Scored Neighborhood Graph 
Detail Description Paragraph : 

[0039] After the graph has been scored, the scored graph 215 is "pruned" 230 to 
produce a pruned graph 216. Here, pruning means removing those nodes and links from 
the graph that are not "similar." There are a variety of approaches which can be 
used as the threshold for pruning, including median score, absolute threshold, or a 
slope-based approach. 

Detail Description Paragraph : 

[0042] One algorithm which performs this scoring is the Kleinberg algorithm 
mentioned previously. This algorithm works by iteratively computing two scores for 
each node in the graph: a hub score (HS) 241 and an authority score 242. 'The hub 
score 241 estimates good hub pages, for example,- a page such as a directory that 
points to many other relevant pages. The authority score 242 estimates good 
authority pages, for example, a page that has relevant information. 

Detail Description Paragraph : 

[0045] If a single node has dominated the computation as a hub node, that is, 
exerted "undue influence", then it is sometimes beneficial to remove that node from 
the neighborhood graph in optional step 250, and repeat the scoring phase 240 on 
the graph with the node removed. One way of detecting when this undue influence has 
been exerted is when a single node has a large fraction of the total hub scores of 
all the nodes (e.g., more than 95% of the total hub scores is attributed to a 
single node ) . Another means determines if the node with the highest hub score has 
more than three times the hub score of the next highest hub- score. Other means of 
determining undue influence are possible. 

Detail Description Paragraph : 

[0053] Our method differs from Kleinberg 1 s algorithm in the scoring phase in that 
we detect cases where a node has exerted "undue influence" on the computation of 
hub scores. In this case, we remove the node from the graph and repeat the scoring 
computation without this node in the graph.. This change tends to produce a more 
desirable ordering of related pages where highly rated pages are referred to by 
more than one page. Kleinberg ! s algorithm does not include any such handling of 
nodes with undue influence. 

CLAIMS : 

I. A method for identifying related pages from a plurality of pages in a linked 
database, comprising the steps of: selecting an initial page from the plurality of 
pages; representing the initial page and pages linked to the initial page as a 
graph of nodes and edges in a memory; repeatedly scoring the initial page and the 
pages linked to the initial page on connectivity of the pages; and selecting a 
subset of the pages scored on connectivity that have scores . greater than a first 
predetermined threshold as the related pages of the linked database . 

II. The method of claim 8 wherein the pages represented in the graph as nodes are 
linked to the node representing the initial page by a number of edges that is 
determined dynamically. 

14. The method of claim 1 including removing any nodes from the graph that have 
scores higher than a third predetermined threshold. 

16. The method of claim 14 wherein the third predetermined threshold is about three 
times larger than a next highest scoring node . 
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File: USPT 



Sep 28, 2004 



DOCUMENT-IDENTIFIER: US 6799176 Bl 

TITLE: Method for scoring documents in a linked database 



Abstract Text (1) : 

A method is presented for scoring documents stored in a network. The method 
includes identifying links from linking documents to linked documents in the 
network and determining an importance of the identified links. The method further 
includes weighting the identified links based on the determined importance and 
scoring the linked documents based on the weighted links. 

Brief Summary Text (2) : 

This invention relates generally to techniques for analyzing linked databases . More 
particularly, it relates to methods for assigning ranks to nodes in a linked 
database, such as any database of documents containing citations, the world wide 
web or any other hypermedia database. 

Brief Summary Text (10) : 

Various aspects of the present invention provide systems and methods for ranking 
documents in a linked database . One aspect provides an objective ranking based on 
the relationship between documents. Another aspect of the invention is directed to 
a technique for ranking documents within a database whose content has a large 
variation in quality and importance. Another aspect of the present invention is to 
provide a document ranking method that is scalable and can be applied to extremely 
large databases such as the world wide web. Additional aspects of the invention 
will become apparent in view of the following description and associated figures. 

Brief Summary Text (11) : 

One aspect of the present invention is directed to taking advantage of the linked 
structure of a database to assign a rank to each document in the database, where 
the document rank is a measure of the importance of a document. Rather than 
determining relevance only from the intrinsic content of a document, or from the 
anchor text of backlinks to the document, a method consistent with the invention 
determines importance from the extrinsic relationships between documents. 
Intuitively, a document should be important (regardless of its content) if it is 
highly cited by other documents. Not all citations, however, are necessarily of 
equal significance. A. citation from an important document is more important- than a 
citation from a relatively unimportant document. Thus, the importance of a page, 
and hence the rank assigned to it, should depend not just on the number of 
citations it has, but on the importance of the citing documents as well. This 
implies a recursive definition of rank: the rank of a document is a function of the 
ranks of the documents which cite it. The ranks of documents may be calculated, by 
an iterative procedure on a linked database . 

Brief Summary Text (13) : 

In one aspect of the invention, a computer implemented method is provided for 
scoring linked documents. The method includes identifying links from linking 
documents to linked documents in the network and determining an importance of the 
identified links. The method further includes weighting the identified links based 
on the determined importance and scoring the linked documents based on the weighted 
links. 
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Brief Summary Text (15) : 

In accordance with yet another implementation consistent with the present 
invention, a method scoring documents stored in a network includes traversing the 
network to identify links between the documents; identifying a location at which 
each of the documents is stored; weighting the links between documents based on the 
identified locations; and scoring the documents based on the weighted links. 

Brief Summary Text (16) : 

In accordance with a further implementation consistent with the present invention, 
a method for scoring documents stored in a network includes identifying links from 
linking documents to linked documents in the network; determining an importance of 
the identified links; weighting the identified links based on the determined 
importance; and scoring the linked documents based on the weighted links. 

Brief Summary Text (20) : 

In yet another implementation consistent with the present invention, a method of 
organizing linked documents includes: (a) identifying a first linked document; (b) 
identifying links between linking documents and the first linked document; (c) 
assigning a weight to each of the identified links; (d) determining a score for the 
first linked document based on (i) the identified links between the linking 
documents and the first linked document, and (ii) the weights assigned to each of 
the identified links; (e) repeating steps (a) -(d) for a second linked document; and 
(f) organizing the first and second linked documents based on the determined 
scores . 

Detailed Description Text ( 3 ) : 

A linked database (i.e. any database of documents containing mutual citations, such 
as the world wide web or other hypermedia archive, a dictionary or thesaurus, and a 
database of academic articles, patents, or court cases) can be represented as a 
directed graph of N nodes, where each node corresponds to a web page document and 
where the directed connections between nodes correspond to links from one document 
to another. A given node has a set of forward links that connect it to children 
nodes, and a set of backward links that connect it to parent nodes . FIG. 1 shows a 
typical relationship between three hypertext documents A, B, and C. As shown in 
this particular figure, the first links in documents B and C are pointers to 
document A. In this case we say that B and C are backlinks of A, and that A is a 
forward link of B and of C. Documents B and C also have other forward links to 
documents that are not shown. 

Detailed Description Text (5) : 

According to one embodiment of the present method of ranking, the backlinks from 
different pages are weighted differently and the number of links on each page is 
normalized. More precisely, the rank of a page A. is defined according to the 
present invention as ##EQU1## 

Detailed Description Text (7) : 

The ranks form a probability distribution over web pages, so that the sum of ranks 
over all web pages is unity. The rank of a page can be interpreted as the 
probability that a surfer will be at the page after following a large number of 
forward links. The constant .alpha, in the formula is interpreted as the 
probability that the web surfer will jump randomly to any web page instead of 
following a forward link . The page ranks for all the pages can be calculated using 
a simple iterative algorithm, and corresponds to the principal eigenvector of the 
normalized link matrix of the web, as will be discussed in more detail below. 

Detailed Description Text (8) : 

In order to illustrate the present method of ranking, consider the simple web of 
three documents shown in FIG. 2. For simplicity of illustration, we assume in this 
example that r=0 . Document A has a single backlink to document C, and this is the 
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only forward link of document C, so 
Detailed Description Text (10) : 

Document C has two backlinks. One backlink is to document B, and this is the only 
forward link of document B. The other backlink is to document A via the other of 
the two forward links from A. Thus 

Detailed Description Text (11) : 

In this simple illustrative case we can see by inspection that r(A)=0.4, r(B)=0.2, 

and r(C)=0.4. Although a typical value for .alpha, is .about. 0.1, if for simplicity 

we set .alpha. =0.5 (which corresponds to a 50% chance that a surfer will randomly 
jump to one of the three pages rather than following a forward link ) , then the 
mathematical relationships between the ranks become more complicated. In 
particular, we then have 

Detailed Description Text (15): 

The iteration process can be understood as a steady-state probability distribution 
calculated from a model of a random surfer. This model is mathematically equivalent 
to the explanation described above, but provides a more direct and concise 
characterization of the procedure. The model includes (a) an initial N-dimensional 
probability distribution vector p.sub.O where each component p.sub.O [i] gives the 
initial probability that a random surfer will start at a node i, and (b) an 
N. times. N transition probability matrix A where each component A[i] [j] gives the 
probability that the surfer will move from node i to node j. The probability 
distribution of the graph after the surfer follows one link is p.sub.l =Ap.sub.O, 
and after two links the probability distribution is p.sub.2 =Ap.sub.l =A.sup.2 
p.sub.O. Assuming this iteration converges, it will converge to a steady-state 
probability ##EQU2## 

Detailed Description Text (16) : 

which is a dominant eigenvector of A. The iteration circulates the probability 
through the linke d nodes like energy flows through a circuit and accumulates in 
important places. Because pages with no links occur in significant numbers and 
bleed off energy, they cause some complication with computing the ranking. This 
complication is caused by the fact they can add huge amounts to the "random jump" 
factor. This, in turn, causes loops in the graph to be highly emphasized which is 
not generally a desirable property of the model. In order to address this problem, 
these childless pages can simply be removed from the model during the iterative 
stages, and added back in after the iteration is complete. After the childless 
pages are added back in, however, the same number of iterations that was required 
to remove them should be done to make sure they all receive a value. (Note that in 
order to ensure convergence, the norm of p.sub.i must be made equal to 1 after each 
iteration.) An alternate method to control the contribution- of the childless nodes 
is to only estimate the steady state by iterating a small number of times. 

Detailed Description Text (17) : 

The rank r[i] of a node i can then be defined as a function of this steady-state 
probability distribution. For example, the rank can be defined simply by r[i] 
=p . sub . . inf in. [i] . This method of calculating rank is mathematically equivalent to 
the iterative method described first. Those skilled in the art will appreciate that 
this same method can be characterized in various different ways that are 
mathematically equivalent. Such characterizations are obviously within the scope of 
the present invention. Because the rank of various different documents can vary by 
orders of magnitude, it is convenient to define a logarithmic rank ##EQU3## 

Detailed Description Text (18): 

which assigns a rank of 0 to the lowest ranked node and increases by 1 for each 
order of magnitude in importance higher than the lowest ranked node : 

Detailed Description Text (19) : 
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FIG. 3 shows one embodiment of a computer implemented method for calculating an 
importance rank for N linked nodes of a linked database . At a step 101, an initial 
N-dimensional vector p.sub.O is selected. An approximation p.sub.n to a steady- 
state probability p . sub . . inf in . in accordance with the equation p.sub.n =A.sup.n 
p.sub.O is computed at a step 103. Matrix A can be an N. times. N transition 
probability matrix having elements A[i] [j] representing a probability of moving 
from node i to node j . At a step 105, a rank r[k] for node k from a k.sup.th 
component of p.sub.n is determined. 

Detailed Description Text (20) : 

In one particular embodiment, a finite number of iterations are performed to 
approximate p . sub . . inf in . . The initial distribution can be selected to be uniform 
or non-uniform. A uniform distribution would set each component of p.sub.O equal to 
1/N. A non-uniform distribution, for example, can divide the initial probability 
among a few nodes' which are known a priori to have relatively large importance. 
This non-uniform distribution decreases the number of iterations required to obtain 
a close approximation to p . sub . . inf in . and also is one way to reduce the effect of 
artificially inflating relevance by adding unrelated terms. 

Detailed Description Text (22) : 

where 11 is an N. times. N matrix consisting of all Is, .alpha, is the probability 
that a surfer will jump randomly to any one of the N nodes, and B is a matrix whose 
elements B[i] [J] are given by ##EQU4## 

Detailed Description Text (23) : 

where n.sub.i is the total number of forward links from node i. The (1-. alpha.) 
factor acts as a damping factor that limits the extent to which a documents rank 
can be inherited by children documents. This models the fact that users typically 
jump to a different place in the web after following a few links. The value 
of .alpha, is typically around 15%. Including this damping is important when many 
iterations are used to calculate the rank so that there is no artificial 
concentration of rank importance within loops of the web. Alternatively, one may 
set . alpha. =0 and only iterate a few times in the calculation. 

Detailed Description Text , (24) : 

Consistent with the present invention, there are several ways that this method can 
be adapted or altered for various purposes. As already mentioned above, rather than 
including the random linking probability .alpha, equally among all nodes, it can be 
divided in various ways among all the sites by changing the 11 matrix to another 
matrix. For example, it could be distributed so that a random jump takes the surfer 
to one of a few nodes that have a high importance, and will not take the surfer to 
any of the other nodes . This can be very effective in preventing deceptively tagged 
documents from receiving artificially inflated relevance. Alternatively, the random 
linking probability could be distributed so that random jumps do not happen from 
high importance nodes, and only happen from other nodes. This distribution would 
model a surfer who is more likely to make random jumps from unimportant sites and 
follow forward links from important sites. A modification to avoid drawing 
unwarranted attention to pages with artificially inflated relevance is to ignore 
local links between documents and only consider links between separate domains. 
Because the links from other sites to the document are not directly under the 
control of a typical web site designer, it is then difficult for the designer to 
artificially inflate the ranking. A simpler approach is to weight links from pages 
contained on the same web server less than links from other servers. Also, in 
addition to servers, internet domains and any general measure of the distance 
between links could be used to determine such a weighting . 

Detailed Description Text (26) : 

Links can also be weighted by their relative importance within a document. For 
example, highly visible links that are near the top of a document can be given more 
weight . Also, links that are in large fonts or emphasized in other ways can be 
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given more weight . In this way, the model better approximates human usage and 
authors 1 intentions. In many cases it is appropriate to assign higher value to 
links coming from pages that have been modified recently since such information is 
less likely to be obsolete. 

Detailed Description Text (31) : 

Another important application and embodiment of the present invention is directed 
to enhancing the quality of results from web search engines-. In this application of 
the present invention, a ranking method according to the invention is integrated 
into a web search engine to produce results far superior to existing methods in 
quality and performance. A search engine employing a ranking method of the present 
invention provides automation while producing results comparable to a human 
maintained categorized system. In this approach, a web crawler explores the web and 
creates an index of the web content, as well as a directed graph of nodes 
corresponding to the structure of hyperlinks. The nodes of the graph (i.e., pages 
of the web) are then ranked according to importance as described above in 
connection with various exemplary embodiments of the present invention. 

Other Reference Publication (2) : 

Copy of claims of U.S. Serial No. 09/895,174, filed on July 2, 2001; Lawrence Page; 
Method for Node Ranking in a Linked Database ; 8 pages. 

Previous Doc Next Doc Go to Doc# 
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L6: Entry 10 of 11 File: USPT Oct 24,. 2000 

DOCUMENT- IDENTIFIER : US 6138113 A / / 

TITLE: Method for identifying near duplicate pages in a hyperlinked/aatabase 




Abstract Text (1) : 

A method is described for identifying pages that are near dupli</ates/in a linked 
database . In the linked database, pages can have incoming links Vana outgoing links. 
Two pages are selected, a first page and a second page. For each selected page, the 
number of outgoing links is determined. The two pages are marked as near duplicates 
based on the number of common outgoing links for the two pages. 

Brief Summary Text (8) : 

The vicinity of a Web page is defined by the hyperlinks that connect the page to 
others. A Web page can point to other pages, and the page can be pointed to by 
other pages. Close pages are directly linked, farther pages are indirectly linked. 
This connectivity can be expressed as a graph where nodes represent the pages, and 
the directed edges represent the links. The vicinity of all the pages in the result 
set is called the neighborhood graph. 

Brief Summary Text (11) : 

In U.S. patent application Ser. No. 09/058,577 "Method for Ranking Documents in a 
Hyperlinked Environment using Connectivity and Selective Content Analysis" filed. by 
Bharat et al . on Apr. 9, 1998, a method is described which performs content 
analysis only a small subset of the pages in the neighborhood graph to determine 
relevance weights, and pages with low relevance weights are pruned from the graph. 
Then, the pruned graphed is ranked according to a connectivity analysis. This 
method still requires the result set of a query to form a query topic. 

Brief Summary Text (14) : 

Provided is a method for identifying near duplicate pages among a plurality of 
pages in a linked database such as the World Wide Web. A first and second page are 
selected for a near duplicate determination. For each page, the number of outgoing 
links is counted. Pages are marked as near duplicates based on the number of common 
outgoing links between the two pages. 

Detailed Description Text (16) : 

We use the initial page 201 to construct 210 a neighborhood graph (ngraph) 211 in a 
memory. Nodes 212 in the graph represent the initial selected page 201 as well as 
other closely linked pages, as described below. The edges 213 denote the hyperlinks 
between pages. The "size" of the graph is determined by K which can be preset or 
adjusted dynamically as the graph is constructed. The idea being that the graph 
needs to represent a meaningful number of page. 

Detailed Description Text (17) : 

During the construction of the neighborhood graph 211, the direction of links is 
considered as a way of pruning the graph. In the preferred implementation, with 
K=2, our method only includes nodes at distance 2 that are reachable by going one 
link backwards ("B"), pages reachable by going one link forwards ("F"), pages 
reachable by going one link backwards followed by one link forward ("BF") and those 
reachable by going one link forwards and one link backwards ("FB") . This eliminates 
nodes that are reachable only by going forward two links ("FF") or backwards two 
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links ("BB") . 

Detailed Description Text (18) : 

To eliminate some unrelate d nodes from the neighborhood graph 211, our method 
relies on a list 299 of "stop" URLs, which are URLs that are so popular that they 
are highly referenced from many, many pages, such as popular search engines. An 
example is "www.altavista.digital.com." These "stop" nodes are very general purpose 
and so are generally not related to the specific topic of the selected page 201, 
and so serve no purpose in the neighborhood graph. Our method checks each URL 
against the stop list 299 during the neighborhood graph construction, and 
eliminates the node and all incoming and outgoing edges if a URL is found on the 
stop list 299. 

Detailed Description Text (19) : 

In some cases, the neighborhood graph becomes too large. For example, highly 
popular pages are often pointed to by many thousands of pages and including all 
such pages in the neighborhood graph is impractical. Similarly, some pages contain 
thousands of outgoing links, which also cause the graph to become too large. Our 
method filters the incoming or outgoing edges by choosing only a fixed number M of 
them. In our preferred implementation, M is 50. In the case that the page was 
reached by a backwards link L, and the page has more than M outgoing links, our 
method chooses the M links that surround the link L on the page. 

Detailed Description Text (21) : 

In some cases, two pages will have identical contents, or nearly identical 
contents. This can happen when the page was copied, for example. In such cases, we 
want to include only one such page in our neighborhood graph, since the presence of 
multiple copies of a page will tend to artificially increase the importance of any 
pages that they point to. We collapse duplicate pages to a single node in the 
neighborhood graph. There are several ways that one could identify duplicate pages. 



Detailed Description Text (23) : 

Relevancy Scoring of Nodes in the Neighborhood Graph 
Detailed Description Text (25) : 

Scoring can be done using well known retrieval techniques. For example, in the 
Salton & Buckley model, the content of the represented pages 211 and the topic 202 
can be regarded as vectors in an n-dimensional vector space, where n corresponds to 
the number of unique terms in the data set. A vector matching operation based on 
cosine of the angle between vectors is used to produces scores 203 that measure 
similarity. Please see, Salton et al., "Ter m-Weighting Approaches in Automatic Text 
Retrieval,", Information Processing and Management, 24(5), 513-23, 1988. A 
probabilistic model is described by Croft et al . in "Using Probabilistic Models of 
Document Retrieval without Relevance Feedback," Documentation, 35(4), 285-94, 1979. 
For a survey of ranking techniques in Information Retrieval . see Frakes et al . , 
"Information Retrieval: Data Structures & Algorithms," Chapter 14 — 'Ranking 
Algorithms, x Prentice-Hall, N.J., 1992. 

Detailed Description Text (26) : 

Our topic vector can be determined as the term vector of the initial page 201, or 
as a vector sum of the term vector of the initial selected page and some function 
of the term vectors of all the pages presented in the neighborhood graph 211. One 
such function could simply weight the term vectors of each of the pages equally, 
while another more complex function would give more weight to the term vectors of 
pages that are at a smaller distance K from the selected page 201. Scoring 220 
results in a scored graph 215. 

Detailed Description Text (27) : 

Pruning Nodes in the Scored Neighborhood Graph 
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Detailed Description Text (28) : 

After the graph has been scored, the scored graph 215 is "pruned" 230 to produce a 
pruned graph 216. Here, pruning means removing those nodes and links from the graph 
that are not "similar." There are a variety of approaches which can be used as the 
threshold for pruning, including median score, absolute threshold, or a slope-based 
approach. 

Detailed Description Text (31) : 

One algorithm which performs this scoring is the Kleinberg algorithm mentioned 
previously. This algorithm works by iteratively computing two scores for each node 
in the graph: a hub score (HS) 241 and an authority score 242. The hub score 241 
estimates good hub pages, for example, a page such as a directory that points to 
many other relevant pages. The authority score 242 estimates good authority pages, 
for example, a page that has relevant information. 



1. A method for identifying pages that are near duplicates in a linked database, 
the pages in the database having incoming links .and outgoing, links, comprising the 
steps of: 

selecting a first page and a second page; 

determining the outgoing links for the first page and the second page; 

determining the number of outgoing links that are common for the first page and the 
second page; 

marking the first page and the second page as near duplicate pages based on the 
number of common outgoing links. 



CLAIMS : 
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ART-UNIT: 212 6 

PRIMARY -EXAMINER : Courtenay, III; St. John 



ABSTRACT : 

In accordance with an exemplary embodiment of the present invention, association 
forming entities are: a) maintained as objects in a like manner to the data objects 
being associated; and b) are themselves partitioned objects comprising two or more 
association fragments, each association fragment being mostly concerned with the 
interfaces to a particular data object participating in the association. In 
accordance with an exemplary embodiment of the present invention, each association 
fragment affiliated with a particular data object is stored in a location that 
enhances the ease of interaction between the association fragment and the data 
object. For example, where a first data object and second data object are 
maintained in data stores at some distance from one another, physically or 
logically, then a first association fragment will be located with or near to the 
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first data object, and a second association fragment will be located with or near 
the second data object, at least within the same partition. This arrangement may be 
preferable because the volume of interaction between a data object and its 
respective association fragment may far outweigh the interaction needed between the 
two association fragments. This arrangement may also be preferable as the volume of 
interaction between a client application and both the data object and respective 
association fragment may exceed the interaction needed between the two association 
fragments. Some interactions will employ only one of the association fragments with 
the net result being a reduction in communications requirements and an improvement 
in performance. The present invention further provides for defining logical domains 
which are arbitrary and entirely orthogonal to partitions. 

64 Claims, 50 Drawing figures 
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ABSTRACT : 

A method is described for identifying related pages among a plurality of pages in a 
linked database such as the World Wide Web. An initial page is selected from the 
plurality of pages. Pages linked to the initial page are represented as a graph in 
a memory. The pages represented in the graph are scored on content, and a set of 
pages is selected, the selected set of pages having scores greater than a first 
predetermined threshold. The selected set of pages is scored on connectivity, and a 
subset of the set of pages that have scores greater than a second predetermined 
threshold are selected as related pages. 

54 Claims, 2 Drawing figures 
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The integrity of uniform resource locator (URL) references within web sites are 
maintained when changes occur in the locations where resources referenced by URLs 
are stored. A Referential Preservation Engine (RPE) maintains a database in which 
the location of web site documents and reference information are stored and updates 
various URL hyperlink references contained in the web pages. on the site so that 
users can locate documents that have been moved to new storage locations. The RPE 
can also update links to external web sites by communicating with an RPE running on 
each external site. The RPE on the external site keeps track of the movement of 
linked documents on the sites and passes information pertaining to the new location 
of the linked documents to the local site, whereupon the links on the local web 
site pages are updated to reflect the new storage locations. The RPE also can track 
usage of a user's favorite sites and/or documents that are stored in an Internet 
browser and update the URL references for these favorites when the resources they 
are mapped to are moved (or renamed) . 
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ABSTRACT : 

A method assigns importance ranks to nodes in a linked database, such as any 
database of documents containing citations, the world wide web or any other 
hypermedia database. The rank assigned to a document is calculated from the ranks 
of documents citing it. In addition, the rank of a document is calculated from a 
constant representing the probability that a browser through the database will 
randomly jump to the document. The method is particularly useful in enhancing the 
performance of search engine results for hypermedia databases, such as the world 
wide web, whose documents have a large variation in quality. 
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database. In the linked database, pages can have incoming links and outgoing links. 
Two pages are selected, a first page and a second page. For each selected page, the 
number of outgoing links is determined. The two pages are marked as near duplicates 
based on the number of common outgoing links for the two pages. 
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